Conversation
Removed commented-out SFTDataset class and its methods.
|
@JasonCZH4 Thank you very much for submitting a PR to this project! I think adding the sequence packing feature is a great optimization that can significantly improve GPU utilization and training efficiency. After reviewing the newly added code, I have the following suggestions:
Finally, thank you again for contributing to this project! |
|
Thanks for reviewing! It is my first time to submit PR. There are still many issues with the current code, and I am working hard to fix them in the coming time. In fact, I am currently conducting many tests. Once everything is OK, I will let you know. Thanks again! |
Thank you very much for your efforts! This is also BumbleCore's first PR, so I attach great importance to it. Wish you all the best! |
wxhcore
left a comment
There was a problem hiding this comment.
Thank you for your contribution! We still have a few issues to address. Additionally, have you compared the training performance of the same model and training configuration with and without sequence packing, to ensure that samples are properly isolated?
| action="store_true", | ||
| default=cfg.get("enable_gradient_checkpointing", False), | ||
| ) | ||
| parser.add_argument("--packing", type=bool, default=cfg.get("packing", True)) |
There was a problem hiding this comment.
When type=bool is used, it doesn't work as expected because passing any non-empty string will be interpreted as True.
| def _is_dist(): | ||
| return dist.is_available() and dist.is_initialized() and dist.get_world_size() > 1 | ||
|
|
||
| class SFTDataset(Dataset): |
There was a problem hiding this comment.
Here are three suggestions for the sequence packing feature:
-
Encapsulate the logic into a
PackingMixinthat can be optionally inherited bySFTDataset,PretrainDataset, etc. -
If it’s only needed for SFT, create a new class that inherits from
SFTDatasetand overrides the relevant methods. -
Alternatively, you could introduce a new
PackingSFTDatasetthat inherits from bothPackingMixinandSFTDataset—this would be a clean and elegant way to compose the functionality without modifying existing code.
Thanks for the review! I have test the performance on Qwen2.5-0.5B-Instruct using the following configuration: I validate the performance using lm-eval and have the following results. Compared to not using packing, there is a slight improvement in model performance and a significant reduction in time when using packing. Now I will turn to implementing the inheritance of packing features. Thanks for reviewing! |

No description provided.